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ON A METHOD OF TESTING THE HYPOTHESIS THAT AN OBSERVED 
SAMPLE OF n VARIABLES AND OF SIZE N HAS BEEN 
DRAWN FROM A SPECIFIED POPULATION OF THE 
SAME NUMBER OF VARIABLES 


By Joun W. FERTIG 


WITH THE TECHNICAL ASSISTANCE OF MARGARET V. LEARY* 


The problem of determining whether or not a given observation may be 
regarded as randomly drawn from a certain population completely specified with 
respect to its parameters is readily solved if the probability integral of that 
population be known. In particular if the population specified be a normal 
population, one may calculate the relative deviate (« — a)/c, where a and o 
are the population mean and standard deviation respectively, and refer to tables 
of the normal probability integral. The hypothesis that x was drawn from 
this population may be rejected if P is less than an arbitrarily fixed value, 
say < .01. Generalizations of this problem may be made in two directions: 
1) May a single observation simultaneously made on n variables be considered 
as randomly drawn from a specified population of n variables? 2) May a 
sample of one variable and of size N be regarded in its entirety as randomly 
drawn from a specified univariate population? 

The solution to the first problem for the case of sampling from a normal 
population of n variables was given by Karl Pearson in 1908! as the ‘‘General- 
ized Probable Error.”’ Let 


x’ ns Ss Pi; E ax) (2; — 2 | 
I i,j=1 O59; 


where a; and o; are the population mean and standard deviation respectively 
of the 7** variable, and P;; is the usual cofactor of the element in the 7** row 
and j** column of the determinant P of population correlation coefficients. 
That is, 


P= lps l|5t,7 = 1, 2,3, ---, 0. 


The probability of an observation yielding a smaller discrepancy than that 
represented by the value of x’, i.e., lying between 0 and x?, may then be 
calculated from Tables of the Incomplete Normal Moment Functions®?. The 
tables are entered in terms of (x?)! and (n — 1), and the tabled value multi- 
plied by (2 x)? or 2 depending upon whether n be even or odd respectively. 


*From the Memorial Foundation for Neuro-Endocrine Research and the Research 
Service of the Worcester State Hospital, Worcester, Massachusetts. 
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The probability of an observation giving a greater discrepancy is then the 
complement of this value. Obviously, this latter probability may be obtained 
directly by entering tables of the X? distribution such as Elderton’s* with n 
degrees of freedom, or through the use of Tables of the Incomplete T-Function‘, 

The second problem, limited to the case of sampling from a normal popula- 
tion, was investigated by J. Neyman and E. S. Pearson in 1928°. The observed 
sample may be regarded as a point in N-dimensional space, where N is the sample 
size. Criteria for the acceptance or rejection of the hypothesis may be asso- 
ciated with contour surfaces in this space, so that in moving out from contour 
to contour the hypothesis becomes less and less reasonable. Frequently, con- 
tour surfaces on which the mean or standard deviation is constant are used for 
the testing of this hypothesis. Such surfaces are deficient inasmuch as they 
are not “closed” contours. Another contour system which appears more satis- 
factory is that of equiprobable pairs of m and s. The latter system in fact 
encloses roughly the same region as do the separate contours for the means and 
standard deviations. These systems are of course dependent on the particular 
statistics chosen to describe the sample and are further limited in that they do 
not take into account the probability of alternative hypotheses concerning the 
origin of the sample. 

Using the principle of maximum likelihood Neyman and Pearson have devel- 
oped a system of contours which is free of the above limitations. The system 
so derived is in fact quite similar to that of equiprobable pairs m and s. Ina 
later paper®, these same investigators have shown that this method of maximum 
likelihood does enable one to select the most efficient criteria for the testing of 
an hypothesis. The criterion selected on this basis is defined as 


‘ Likelihood that sample came from specified population 
‘ = ie ae ° x ay: mA a re oer is - - 
Maximum likelihood that sample came from some other population 


= (s?/o?) Ni2@ N/2 = + (e om sf —— | 


o? 
where a and o are the population mean and standard deviation respectively, 
and ¢ and s the sample mean and standard deviation. 

\ is constant upon certain contour surfaces in N-dimensional space, and dimin- 
ishes on passing outward. The form of the surfaces is independent of N. It 
is evident that \ must lie between zero and unity. When it is close to unity 
we know that it is reasonable to assume that our hypothesis is true, when 
small we know that it is unreasonable. But we must know the probability of 
\ less than a certain value occurring when the hypothesis tested is true, so that 
we may control another source of error, namely, that of rejecting the hypothesis 
when it is true. In other words, we must know the sampling distribution of }, 
so that we will reject the hypothesis only when the probability of obtaining a 
smaller value is negligible, say P, < .01. Neyman and Pearson were not able 
to evaluate this distribution but they were able to integrate the original density 
function of the population appropriate to N-dimensional space outside of the 
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various \ contours. This they were able to do by effecting a transformation 
of the density function and contours to the plane of m and s. These values 
of P, have been tabled by them’, the tables being entered in terms of N and 
k, where 


| — log (s?, °) 


The generalization of either of the above problems requires a criterion to 
test an hypothesis which may be formulated as follows: Given a sample = of n 
variables and of size N with means #1, %, --- , #,, standard deviations 
$1, $2, +++ , Sn, and correlation coefficients ri, 713, «++ 5 Tiny 723) °° +» T2ny ++ * 9 TnI) ny 
may we regard this sample as randomly drawn from a population 7 of n varia- 
bles and completely specified with respect to all its parameters? We shall 
restrict our inquiries to the case where z is a normal population. In this case 
the distribution law is 


1 —-_ 
SS ecient a 
(27)"?o,02 +--+ o,P? 


(on —" a 
bd 2 te _ai)(a, = 


2P \i,j=1 0:0; 


fltuy My +++ » Za) 


where 


where a; and o; are the population mean and standard deviation respectively 
of the 7*® variable, and P and Pj; are as previously defined. 

Thus the probability that 2 has been drawn from 7 with its N values of 
tia (i = 1, 2, --- , m) lying in the interval x4 + }d%ia; (a = 1,2, --- , N) is 
given by 


1 F e 
C= a ta ax 
—_ tee | —— 


where 


1 ( a nd (Vie — i) (Xa — aj) ) 
2P \ Ss Pi; Ss | at — ( 


Lt,j=l a=1 0; 0; 


7 { n P.O. 7. is a tims ) 
N | S Py | 2 + (3%; ai) (2; oe | 


ts 0; 0; 


dX = TI [I dtc 


i=l a= 





/ 


The likelihood that = has been drawn from any other normal population, 
such as 7’, is given by 


N 
—e9’ , 
ig = oe he a ne eé dx 
2°-- 6,P" 
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where 





/ / 
o;0 


ict ot , ~ 4 = ; 
i ae N i g Pl, k 8Ty + (3 "hen a; (3 ;- =| , 


— ep 
a 1¢,3=3 


d 


The population from which it is most likely that = has been drawn is that 
for which C’ is a maximum. The values of the parameters of this population 
may be obtained by putting 


, , 
eC =-0, i (Gi=1 
aa ; 00; 
a: (t,7 = 1,2, ---, n) 
ap; ; 


These conditions are fulfilled when 


So that 


where 
R = |ry\; 1,j=1,2,---,n 


The appropriate criterion to select in order to test our hypothesis is thus 


ef ails a ad 
ae C i. 2% sn,R | e— 


sae? J} 
6 0102 on I 





where 


w= —< a 


N ( a P;; a ri + (2; = a;) (2; — ‘| ‘ 
2 \i,j=1 r 


0,0; 


The equations 4 = constant represent a series of contours in N-dimensional 
space. As we move outward from contour to contour our hypothesis becomes 
less and less acceptable. Although we may be confident that the use of this 
criterion will minimize the chance of accepting the hypothesis when it is false 
we must know the frequency with which samples occur outside of a given \ 
contour when the hypothesis is true. In other words, we must know the inte- 
gral of C outside of various contours, or else we must know the sampling distribu- 
tion of X. The former is an exceedingly difficult method for n greater than 


unity. Thus for the case of n = 2 we should have to integrate some such 
expression as 





N-4 
ail 


i aml 9 oe = 
ksi 3 e9(1 — r?,) ? d#,d2,ds,ds.dry 
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outside of the various contours. Nor have we so far been able to evaluate the 
sampling distribution. We can however give an expression for the moments 
of \ and thus reach an approximate distribution. 

Wilks® has derived expressions for the moment coefficients about zero for 
the maximum likelihood criterion that k samples of n variables and of N; 
observations each have been drawn from the same unspecified normal popula- 
tion of » variables. Thus, 


U(r asf ay [eh=2) 
’ IT: SN 2 I 2 
(A) = ; , ii 


—_ \] ¢«=1 
t=1 | N; i=1 ey 


S N,—i 


2 


fiyiata2 (1 +h) ; N.—i 
—— — 


from which we can write expressions giving the moment coefficients about zero 
for the \ criterion for two samples 


nh(Ni+Ne2) 


(N; + No) 
nO). eee 
2 


Ni Ne 


. Nil +h) — i NA1 + h) — 1 | Ni+ N2-—7 
ri pe | 
a Ni-?7 Ne —1 (Ni + Ne) (1 +h) —2 

| r( 2 ) r( 2 rl 2 | 


The limit of this latter expression as Ne — © will be the moment coefficient 
about zero for the \ criterion that one sample has been drawn from a specified 
population. Thus 


nhNy —nN,(1+h) 
n eee “eed 


( et 4+) — ; 

3 |l foe? 

Lim. u,(d) = | Ty, () (1 + h) 
Nooo Pe r(2 > ) ivi 


\ 


Various roots of \ are distributed to a good degree of approximation according 
to a function of the form 


T'(m, + me) 
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where 





, , Pes ’ 2 , , 
. ™, = Mi(ur — Me)/ (M2 — HW); me = (1 — wy)mi/uy 


and the value of u, for roots of \ may be obtained by replacing A in the original 
expression by h times the desired root. Measures of the skewness and kurtosis 
of this distribution are given by 


B, = 4(m, — me)?(m, + me + 1)/myme(m + me + 2)? 





By = 3B,(m + me + 2) + 6(m + me + 1)/2(m + me + 3) 


A comparison with the true measures of skewness and kurtosis for various roots 
of \ as given by 


572 / 2 
By, = o3/be ; Be = pa/be 























will afford a measure of the goodness of the approximation and the range of 
values of N for which any particular root will be distributed as assumed. 

Investigating the moments for n from one to four and N from three to fifty 
we note that in the case of samples of two and three variables, \" follows the 
assumed distribution for N from 3 to 15; \?/" from 15 to 30; A*/* from 30 to 50. 
In the case of four variables, \/2% follows the distribution for N from 5 to 10; 
AYN from 10 to 20; A?/* from 20 to 40; A*/* from 40 to 50. It appears likely that 
for higher values of n, for N small, some such root as \1/2% or \/8% will follow the 
assumed distribution, while as N increases smaller roots will follow it. For 
any value of n, the smallest permissible value of N is (n + 1). 

The probability that a smaller value of \ will be obtained when the sample 
has actually been drawn from 7, i.e., P,, may thus be obtained by reference to 
Tables of the Incomplete B-Function® with p = m, q = me, x = value of the 
particular root of the observed }. We may also get the 1% and 5% levels of 
significance directly from Fisher’s” tables of ‘‘z’”’ or Snedecor’s" tables of ‘‘F”’ 
(= e”), by taking 


Nn, = 2m; No = 2m; L = no/(n2 + mF) , 


where L is the desired root of \. Linear interpolation will generally suffice 
except for very small values of N. 
For the case of N — «, we have 


1 


3 «4/3 







n 


Lim. u,(A) = (1 + h) 


N-- 2 


| &+ 


9 


Thus the quantity (—2 log \) will be distributed in the x? distribution with 
S 2 degrees of freedom. 


+=2 
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A table of the 1% and 5% levels of significance for n equal one to four, and 
values of N from five to © is given below 


5% and 1% Levels of Significance of ‘‘d’’ 
‘iti N a 


15 20 








x 10 .046 





Am 





x 10- 





x 0 





x 10-5 : . 106 174 . 29: 356 .418 


% X 10~* ; .040 .075 . 14: . 185 221 466 


A check on the accuracy of the method of approximation used may be obtained 
by comparing the values of P, for the case of n = 1 with the exact values given 
by Neyman and Pearson. For n = 10, \'/* is distributed as assumed with 
m, = 9.0562, m. = 0.9987. For the case of (¢ — a)/o = 0.2, s/o = 1.2, we 
find k = 0.48439, AY" = .94395. From the Tables of the Incomplete B-Func- 
tion we find P, = .5936, from Neyman and Pearson’s tables, .5935. 

No studies have been made on the extent of deviation from normality per- 
missible for the application of the test. There is no reason to doubt, however, 
that as much deviation is permissible as in the case of the univariate \. From 
theoretical considerations and from sampling studies Neyman and Pearson con- 
clude that the univariate \ technique holds for deviation from normality to 
the extent of +0.5 for B, and 2.5 to 4.2 for Be. 

We are confident that this generalized \ technique will be found useful in 
biological research. If the n variables were uncorrelated we would be able to 
test whether the sample had been drawn from the population of n variables by 
successive applications of the univariate \ technique and then combining the 
resulting probabilities. In general, however, there will be some correlation 
between the variables, however slight. The method here proposed will take 
account of all possible intercorrelations, and consequently all multiple and 
partial correlations. 

Now, if P, is less than some arbitrarily fixed value, say S .01, we may decide 
which variable or variables contributes most to this result, by performing 
simpler \ tests. It may be due to one or more of the means, standard deviations, 
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or correlation coefficients. As may often be the case, it is not due to any one 
factor but to contributions from all of them. That is, all possible factors 
tested separately might show a fairly reasonable value of P, but if all the 
separate values are combined somehow, as by means of this \ method, the 
resultant P may be too small. It is in such problems that this technique should 
provide valuable information. 

In case k samples of n variables are available it should be possible to deter- 
mine whether all of them have come from the same specified population of n 
variables by performing k \ tests and combining the separate values of P). 
Such a hypothesis may best be tested, however, by a further extension of the \ 
theory which the writers are at present investigating. 

The following problem is chosen to illustrate the computations involved in 
the application of the test. Many of the investigations pursued at the Wor- 
cester State Hospital attempt to differentiate between schizophrenic patients 
and normal controls. In one such type of investigation various blood constit- 
uents were determined, namely, Urea Ne (mg./100 cc.), Uric Acid Ne 
(mg./100 cc.), Creatine Ne (mg./100 cc.) for a sample of twenty-five schizo- 
phrenic patients. Previous investigations on these same variables for a large 
series of normal controls yielded constants which for the purpose of the 
example may be considered as the population parameters. Past studies on 
these variables have not shown any marked degree of non-normality for the 
various distributions. 

These variables are designated as 


1 = Urea Nz; 2 = Uric Acid N2; 3 = Creatine Ne 

The parameters of the population are given by 
a, = 16.03 ; a2 = 1.40; a3 = 1.25 

? = 20.268 ; o; = 0.029 ; o; = 0.025 














pi2 = ~—«.3075 ; pis = .1232; p23 = ~=.38853 
The statistics for the sample of twenty-five are 

Zi = 15.56; Zo = 1.42; #3; = 1.25 

si = 10.486; s3 = 0.043 ;sx s; = 0.025 

me = —.0161; ris = 0925; reg = .2174 


None of these statistics differs significantly from the corresponding parameters. 
R = 0.9448 ; P = 0.7710 ; 
P}2/P = — 0.3373 ; P,3/P = —0.0061 ; P93, r 


— 0.4506 ; 
Py/P = 1.1045; Pee/P= 1.2773; Ps3/P = 1.1744 
w = 12.5 (0.3802) = 4.7531 
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(sj s3 sz R/o{ o3 03 P) = 0.9001 
log \ = 12.5 log (0.9001) — 4.7531 log e = 3.3641 
\ = .0023 


Since the 5% level of significance is about .0001, we thus conclude that the 


patients are not differentiated from the control population with respect to these 
variables. 


— 
_ 


SCMOON OO PWD 
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ON CONFIDENCE RANGES FOR THE MEDIAN AND OTHER 
EXPECTATION DISTRIBUTIONS FOR POPULATIONS OF 
UNKNOWN DISTRIBUTION FORM 


By Wiiuram R. THompson 


About the commonest situation with which we are confronted in mathematical 
statistics is that where we have a sample of n observations, {x;}, which is 
assumed to have been drawn at random from an unknown population, U, with a 
zero probability that any two values in the finite sample be equal; and we 
desire to obtain from this evidence some insight as to parameters of the parent 
population, U. If further assumptions are made as to some of the parameters 
or the form of U, there may result a gain in power in testing other given hy- 
potheses or establishing confidence ranges for particular parameters, but at an 
obvious sacrifice of scope in application. Insistent problems involve estimation 
of mathematical expectation that in further sampling we shall find x lying within 
a given interval, or similar expectation with regard to parameters of U such as 
the unknown median. It might seem that, without further assumption, all 
we should claim is that it is possible to draw from U the sample actually ob- 
served. A mere description of the experience may well be considered the 
observer’s first duty, but a restriction to this would leave entirely unused the 
quality of randomness which has been assumed. What additional statements 
as to U may be appropriate in view of this randomness are our immediate 
concern; and the object of the present communication is to show how we may 
obtain such expressions in the form of mathematical expectations, and to 
present some results. Widespread applications to problems of estimation of 
normal ranges of variation or specific confidence ranges and comparisons of 
sample reflections of possibly different populations are immediately suggested, 
and a new foundation is offered for the study of frequency-distribution from the 
point of view of Schmidt.! 


Section 1 


Accordingly, consider the following situation. Let A = {2x} denote the set of 
all real numbers; and U denote an unknown frequency-distribution law of draft 
from {x} such that there exists an unknown function, f(x), bounded and not 
negative in A, and that the probability of obtaining z in an arbitrary interval 


(a, B) is 


(1) Pa<z<fp= [F s0-ae; 





1Schmidt, R., Annals of Math. Slat., 6, 30, (1934). 
122 
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and, for every positive p < 1, there exists a finite interval (a, 8) such that 
Pia <2<B)>p. Let U be called an infinite population; and let n drafts, 
independently thus governed, made from A without replacements be called a 
random sample of n observations from U. Let S = {ax}, k = 1, --- ,n, denote 
such a sample; the enumeration to be made in an arbitrarily determined manner. 
‘In any case 2; ¥ 2; for? ¥ j. 

Temporarily, let us consider k to indicate the order of draft of the values of 
{z.}, and let px = P(x < 2x) denote the probability that x, drawn at random 
from U, be less than x; of S. The probability a@ priori (i.e., without regard to 
relative values of x in the sample) that in such random sampling p; lie between 
p’ and p’’, where 0 S p’ < p”’ S 1, is obviously independent of k, and 
equals p’’ — p’;i.e., px is equally likely a priori to lie in either of any two equal 
intervals in its possible range, (0, 1). Furthermore, the probability that in 
the rest of the sample, S, there will be just r values less than x; is 


oi 
(" i )- pt (l= pi, 


where r is an integer and0 < r < n. Of course, p; is unknown; but we may 
calculate (for all cases in repeated sampling wherein the same value of r is 
encountered) the expectation, P, (p’ < px < p’’), that px lie in the interval 
(p’, p’’). This is given by 


_ ; . . s 1 ! pit 
P,(p’ < Pr a p ) = ee / p" e q* . dp ; 
. . p! 


where s = n—1—r7,andq=1—p. Thisisa familiar result?*4in applications 
of the well-known principle of Bayes to estimation of a posteriori probability. 
The approach is convenient in that many relations which have been developed 
in this connection are made immediately available. However, that p; is 
equally likely @ priori to lie in either of any two equal intervals in its possible 
range, is not based in the present case upon an especially added assumption 
nor any plea concerning equal distribution of ignorance, but follows directly from 
the elementary assumptions of random sampling. Accordingly, we are enabled 
to develop for given ranges what may be called the specific confidence or mathe- 
matical expectation that a given variable lie therein. 

Obviously, (2) does not depend on k if this index is the order of draft provided 
that just r values of the sample, S, are less than the one under consideration, x;. 
To simplify notation, accordingly, let the index k for any given sample, {2x}, 


2 Bayes, Philosophical Transactions, 53, 370 (1763). Cf. Todhunter, I., ‘A History of the 
Mathematical Theory of Probability,’’ Macmillan and Co., London, 1865. 

3 Laplace, ‘‘Théorie Analytique des Probabilités,’’ Paris, 1820; and other works, Cf. 
Todhunter, l.c. 

4 Pearson, K., Philosophical Magazine, Series 6, Vol. 13, 365, (1907). 
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be determined by the relations, x; < 2; for 7 <j, where k = 1,---,n. Then, 
by (2) ask = r+ 1, we have 


es ia n! ene - 
(3) P(p <m<P)= gam | pt. ge * da, 


p! 


where p; is the probability that random sample values from U will be less than’ 
the k-th value in order of ascending magnitude from a given random sample, 
{xz}, of n values from U; and P(p’ < pi < p’’) denotes the expectation that in 
such sampling p;, will lie in the interval, (p’, p’’). 

In general, let E(w) = @ denote the mathematical expectation of any 
variable, w, under the given sampling conditions. Then, from a well-known 


relation developed by Laplace, we obtain from (3) the mean expectation of pz, 


——— 
(4) Pi = 2? 


and, further relations‘ of Karl Pearson yield 









-\.  ~2  k(n—k+1) . 
(5) E((p, — Bd") = om, = Gain 2) 








i.e., the mean squared error in systematic use of instead of the unknown 





n+1 
px should have the value in (5). Specific confidence ranges for x are readily 
established; e.g., the expectation that in random draft from U we obtain z 
within the range (xx, Yn—-x41) in view of the sample, S, is 











(6) P(x, < 2 < Xn-241) = _2'-2 for 2k <n+1; 
n+1 
and P(r < x) = P(t > 2n-i41) = aT For a given variate, w, the range 


(a, 8) will be called central if P(w < a) = P(w > 8B), as in the case under (6). 
This is in accord with the development of the subject of confidence ranges by 
Neyman** and by Clopper and E. 8. Pearson’ following the introduction of the 
notion of fiducial interval by R. A. Fisher.*:® The estimates of p; in (4) may 
be of value in studying frequency-distribution from the point of view developed 
. = :) rather than (7 ) where 
y is a univariant inverse of the integral of a given frequency function, taken to 


e 








by Schmidt,! by comparison of x; with u( 













5 Neyman, J., J. Roy. Stat. Soc., 97, 589, (1934). 
6 Neyman, J., Annals of Math. Stat., 6, No. 3, 111, (1935). 

7 Clopper, C. J., and Pearson, E. S8., Biometrika, 26, 404, (1934). 
8 Fisher, R. A., Proc. Camb. Phil. Soc., 26, 528, (1930). 

9 Fisher, R. A., Proc. Roy. Soc., A 139, 343, (1933). 
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1 
n+1° 
sion of the special case, n = 2, has been prominent recently in a controversy 
between Jeffrey” and Fisher" and in an article by Bartlett.” 

Now, in (3) for p = p’, and p’’ = 1; we may write” 


replace the unknown f(x). Obviously, P(z, < x < 241) = A discus- 


k—-1 


@) Pip<m) = >)(")-pe-grs = Ln k 41,8) = 


a=0 


Bin —k + 1,k) 
Bin —k +1,k)’ 


where g = 1 — p, and the incomplete B and J functions are those of K. Pearson™® 
and Miiller.4 Now, let M be the unknown median of the infinite population, 
U. Then, by definition of px, if and only if 7, > M, then p; > 4. Therefore, 


k-1 
1 


(3) P(M <x) = B05 <p) = Gy. > (”) = Ins(n — k + 1,h). 


4 (?), and the expectation that M lie 


between the k-th observations from each end of the set, S, is given by 


Obviously, P(x: < M < 241) = ( 


(9) P(t. < M < tenn) = 1 — 2-Ias(n — & + 1, bk), for 2k <n +1. 


Obviously, this confidence range is central. 


Section 2 


Now, consider another infinite population, U’. In similar manner we may 
develop expressions for confidence ranges and distribution expectations. Let 2’ 
be the variate, and consider a sample, S’ = {z,.}, of n’ observations drawn 
without replacements from A according to U’ but after the sample, S, of U; 
i.e., So that no two of these sample values in S’ are equal, nor any of them equal 
to a value in S. Furthermore, let m be the order of ascending magnitude of x’ 
values in S’; and D,. = P(x’ < x.) for x’ drawn at random from U’, and let M’ 
be the unknown median of U’. Then, by replacement of x, n, p;, k, and M by 
x’, n’, p,., m, and M’, respectively, in relations already developed for U and S, 
we obtain corresponding expressions for U’ and 8S’; e.g., 


, , 1 
(10) rm, <2 < Sead Tal" 

0 Jeffreys, H., Proc. Roy. Soc., A 138, 48, (1932); A 140, 523, (1933); A 146, 9, (1934) ; 
Proc. Camb. Phil. Soc., 29, 83, (1933). 

1 Fisher, R. A., Proc. Roy. Soc., A 146, 1, (1934). 

2 Bartlett, M.S., Proc. Roy. Soc., A 141, 518, (1933). 

13 Pearson, K., Biometrika, 16, 202, (1924). 

144 Miiller, J. H., Biometrika, 22, 284, (1930-31). 
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Now, let the index values, k,,, be defined as the number of values of {2;} that 


/ + . 
are less than z,,,m = 1,---,n’. Then, for all realized cases, 


, , 
(11) Liem < Xm < Likm+1) m= m nis * Meg 
for the extreme members of (11) in S. Then, for x and 2x’ drawn at random 
from U and U’, respectively, we may write 


(12) 0< (n+ 1)(n' + 1)-P(z < 2’) — Do kn << nn’ +1, 

m=1 
provided that the expectations for U and U’ may be treated as independent. 
Similarly, for P(M < M’) we have the relations, 


, 
n 


pa ( ) tas(n — km +1, km) < 2"2-P(M <M’) <1 


n’ 
m 
n’ n! 
+ 2 # a 1) oan ees Riss Ra a 1). 


Of course, Jo.3(n + 1,0) = 0, and Jp.5(0,n + 1) =1. It may be verified readily 
that the inequality relations of (12) and (13) provide best upper and lower 
bounds for P(x < x’) and P(M < M’) under the circumstances given. 

Obviously, any increasing function, ¢(y), for y in A, may be used throughout 
the arguments, with ¢(y) replacing y = x, rx, M, x’, x,, M’, respectively. 


m= 


(13) 


Section 3 


Consider, now, the case of a finite population, Uy, of real numbers jx}, 
2 < 2 fori < j,i = 1,---,N. Assume that N is known, and that a 
sample, S, of n values has been drawn at random from U y without replacements. 
Let the sample values be {a;,},k = 1, --- ,n;and k be an arbitrarily determined 
index. As before, we might consider k the order of draft, temporarily, but the 
same analysis may be made if we let k be the order of ascending magnitude in 
the sample, S, and disregard its value in connection with @ priori estimates of 
draft probability. Each x, = 2“* for some unknown u; = 1, --- , N; and, 
a priori (i.e., with no knowledge as to order of magnitude of other values in the 
sample), any two of these values are equally likely. Obviously, this is so if x; 
is the first value drawn from Uy, and the rest of the sample may be regarded 
as a random draft without replacements of n — 1 elements from [Uy — xl. 
Let r be the number of these sample values less than z;, and s = n —1—r. 
Then the probability of drawing such a sample after the given x;, under the 


UE — ') N — u% 
N ') 













conditions given, is where u, — 1 is the unknown number of 
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values in Uy that are lessthan z;. To estimate the expectation, P(R = u;z — 1), 
that there are just a given number, R, of values in Uy less than 2;; we encounter 
the same situation considered by K. Pearson in a paper subsequent to those 
applied to the infinite universe; and, by a simple conversion in notation, we have 


(7).(" —1- " 
) em « 2) we Oe 
(14) P(R = ux — 1) (*) 

n 


In previous communications'*!7 I have defined a function, 


, 


. e* - +s'+1 mn 
r 8 
(15) V(r, 8, s’) = «= ’ 
r+str+4+e08'/4+2 
( r+s+1 ) 





for any four rational integers r, s, r’, s’ 2 0; and shown that Pearsons further 
result, equivalent here to evaluation of P(uz S R + 1) for a given R, may be 


expressed by means of this y-function. Thus, we have 
(16) P(u. = R+ 1) = Wr, 3, R — 1, N — R — 8 — 2). 


It was demonstrated also’: that 
(17) V(r, 8, r’, s’) = V(r, r, Ss, s’) = Y(s’, is 8, r) = 1 = V(s, r; v; r’) 


with extension of the definition to include y(r, s, —1, s’) = 0, and that 
= r+r+1)\(s+s'+1 
/ = Vr l+a S—a 
(18) V(r, 3,7’, 38!) = ————_ ao : 
r+s+r+s42 
r+s+1 


As in the ease of the infinite population, here also it is obvious that the order 
of draft of x; is of no consequence in the analysis; and again we will let k = r + 1, 
whence s = n — k, and we may make these substitutions in (14) and (16). 
Then, we may write 


(19) P(u, s R) = ¥kK-—1n—k,R-—kk+N—R—2n— 1); 


15 Pearson, K., Biometrika, 20 A, 149, (1928). 
1 Thompson, W. R., Biometrika, 25, 285, (1933). 
17 Thompson, W. R., American Journal of Mathematics, 57, 450, (1935). 
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and, obviously, P(u,-n1 = N — R+ 1) = P(ur S R). Hence, if we let M 
be the unknown median of Uy; and m = * = zs where a = 0, 1, and N — ais 


even; then, as wv; is an integer, 
il _s N 
P(x: =<=Ms Ln—k41) = Plu Ss — S Uae 
(20) 
=1-2-~k—l1n—k,m—k,k+N—m-—-n-1), 
which is the expectation that the median of Uy lie within the closed interval, 


(a, 2n—ky1), for 2k S n+ 1. This gives the confidence range, analogous to 
that for the infinite universe. It may be noted that 


P(uz SR < Uns) = P(ux S R) — P(unsys S R) 


= ¥(r, 8,7’, 8’) —¥7+1,8—1,r’— 1,8’ +1) 


whereer=k—l,s=n—k,r’ = R—k,ands’ = kK+N—-R—-n-1. 
Hence, (18) gives 


(7). (" o 4 
(21) P(uzr SR < Uys) = : —s 


(n) 


The approach by way of Pearson’s problem again makes it easy to evaluate 
the expected mean p; and variance as in the case of the infinite population, 
ur—l1 


where p;, = P(x < xx) for x drawn at randomfrom Uy. Of course, px = NT 


but uw, is unknown. From Pearson’s result,!> however, we obtain 


. kKN+1)—n-1_ k& *) k—1 
mm ” N(n + 1) n+1 (1 ~ N + N ’ 
and the expected variance of pi, 
. i — -~\»  k(n—k+1)(N + 1)(N — n) 
(23) o>, = E((pe — px)*) = ~~ n+i1?-(n+2)-N2 
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THE SAMPLING DISTRIBUTION OF THE COEFFICIENT OF 
VARIATION 


By Water A. HENDRICKS WITH THE ASSISTANCE OF KATE W. RoBEy 


National Agricultural Research Center, Beltsville, Maryland 


The coefficient of variation does not appear to be of very great interest to 
statisticians in general. However, its use in biometry is sufficiently extensive 
for some knowledge of its sampling distribution to be desirable. The present 
paper is an attempt to satisfy this need. 

For the purposes of the following discussion, the coefficient of variation may be 
defined as the ratio of the standard deviation of a number of measurements to 
the arithmetic mean: 


As is well known; the probability that the mean of a sample of n measurements, 
taken at random from a normal universe, lies between % and + d% and that the 
standard deviation of the measurements in the same sample lies between s and 
s + ds is given by the relation: 


nin — — [(#—m)*+s"] 


aF;,.= e 2¢ i . (2) 


Din-l x r(*5 ') o” 





If equation (2) is expressed in terms of polar coérdinates by means of the 
transformation: Z = pcos 0; s = psin 0, it becomes a distribution function of p 
and @in which @ = arc tanv: 


n 2 } 
(p"—2m p cos 8 + m*) 


— p”—! sin”? @ dp dé. . (3) 


In equation (3), p may vary from 0 to © and @may vary from Oto. To find 
the distribution function of 6, all that is necessary is to write: 


dF, =k | e~(ap—b)* art ‘p| dé 
0 
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in which, 















nin ”  m sin’6 





BO emigre Oe sin”? @ 
Qin-! 7 r(” 3 ) o” 
4 4 
n n 
a= —, and b= — mecos@, 
ho hg 









and to perform the indicated integration. 
To evaluate the integral inside the brackets in equation (4), we may write: 


il a pit (n — 1)! scsieiiiaes 
[oom ldp = pe u™—!-i bf du... .(5) 


0 


Consider the integral, r- e un-'-i du. If b is sufficiently large, as is the 


—b 

case when the parameters of equation (2) are of such magnitude that practically 
the entire volume under the frequency surface lies to the right of the s axis, that 
is to say, if negative and small positive values of € occur so infrequently that their 
effects may be neglected, the lower limit, —b, of this integral may be replaced 
by — © without introducing any appreciable error. The value of the integral, 
[ e-“ y"-1-i du, is zero when n — 1 — 7 is odd and (25+) when n — 1 —1 


CO 
is even, zero being counted as an even number. 


Subject to the above condition that b be sufficiently large, we may, therefore, 
write equation (5) in the form: 


1 7 (n — 1)! as 
e-@ —b)? pe? i 
I ; alee a” (n — ee 2 ae - -(6) 


in which the symbol, >,’, indicates that the only terms entering into the summa- 
tion are those in which n — 1 — 7 is an even number. 


Substituting this expression for the integral inside the brackets in equation 
(4), replacing k, a, and b by the quantities which they represent, and writing V 


in place of the ratio, —, we obtain the following distribution function of @: 
m 











exe o's) 
a — —~, sin’ a Ae 7 ne eta - 
dF, = nailer eas e 2¥ sin"-2 @ Li “Gnidia ay cos’ 6 dé. (7) 
qr? r( 5 ) 1=0 


Equation (7) may be written in terms of », if desired, by making the substitu- 
tion, 6 = arc tanv: 









¢ 


SAMPLING DISTRIBUTION OF COEFFICIENT OF VARIATION 


n v2 


ee 2 yn—2 
2v2 1+? v 


(1 + v0)!” 


sl _pir(®=2 
Se = r( 2 ) ni — 
Ss (n—1— ali! VWvyia+eri 


1=0 


It must be emphasized that equation (8) has been derived on the hypothesis 
that negative and small positive values of % occur so infrequently that they may 
be neglected. However, since this condition is satisfied in the vast majority 
of practical problems in which the coefficient of variation is likely to be used, 
the limitation is not of much practical importance. 
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Fic. 1. OBSERVED AND THEORETICAL DISTRIBUTIONS OF VALUES OF v FOR 512 SAMPLES 
or NUMBERS OF HEApDs APPEARING IN Two SvuccgEssivE TosseEs oF TEN CoINs 


As a test of the validity of equation (8), the authors calculated 512 coefficients 
of variation of the numbers of heads appearing in two successive tosses of ten 
coins. The coins were tossed 1024 times, thus yielding 512 samples, each con- 
sisting of two observations. For these data we have m = 5, o = 1.581, and 
V = 0.3162. 

For the case, n = 2, equation (8) reduces to: 

2 -= 


ak, = e€ 
mV 


Figure 1 shows the distribution of the 512 values of v obtained from the coin 
tossing experiment, together with the theoretical distribution given by equation 


(9). 


An inspection of Figure 1 indicates that the agreement between the observed 


at? 
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and theoretical frequencies is fairly good. An application of the familiar chi 
test for goodness of fit showed the agreement to be rather poor. According to 
this test, the degree of discrepancy between theory and observation could have 
arisen by chance less than once in a hundred trials. However, the discrepancies 
may be partly due to the fact that data distributed in a discrete fashion were 
treated by methods appropriate to the analysis of data distributed according 
to a continuous frequency curve. 

As another test of the validity of equation (8), the authors calculated 149 
coefficients of variation of “days to maturity,” which is the length of time elaps- 
ing between the date of hatch of a chicken and the time egg production com- 






















2,400 


2,200 


2,000 


1,800 


1,600 





1,400 





1,200 





1,000 













FREQUENCY 


800 


600 


400 


200 


sillnialliabicliscssslisnd boniitnsstion 





e 
0 
.000 .0I5 030 .045 O60 075 .090 105 .120 135 150 165 180 .195 
VALUE OF V 















Fic. 2. OBSERVED AND THEORETICAL DISTRIBUTIONS OF VALUES OF v FOR 149 SAMPLES 
, oF “Days To Maturity” In RuopE IstanpD Rep PULLETS FoR SAMPLES OF Two 
OBSERVATIONS 





mences, for samples of two observations made upon Rhode Island Red pullets. 
Figure 2 shows the observed distribution of the 149 coefficients of variation, 
together with the theoretical distribution given by equation (9). 

In applying equation (9) to these data, the parameter, V, had to be evaluated 
from the data. The best estimates of the values of m, ¢, and V which could be 
obtained from the 298 measurements of “days to maturity” are m = 210.477, 
o = 18.6991, V = 0.0888415. The theoretical distribution shown in Figure 2 
is based on this value of V. 

The agreement between theory and observation shown by Figure 2 is very 
good. In this case, the chi test showed that the degree of discrepancy en- 
countered could have arisen by chance about six times in ten trials. 



















SOME NOTES ON EXPONENTIAL ANALYSIS 


By H. R. GRuMMANN 


Assistant Professor, Department of Applied Mathematics, Washington University 





M. E. J. Geuhry de Bray in his charming little book “Exponentials made 
Easy’’! tells how to determine the constants in the equation, 


) y = Ave™ + Acer 
















so that the curve will pass through four points, with equidistant ordinates on 
an empirical curve. If (Fig. 1) yo, y:, ye, and y3 are the equidistant ordinates 
and 6 is their common separation, yo being the y intercept of the curve, de 
Bray’s formulas are: 

log 2 log ze 
(II) a ~ 


[= . ’ 2 = 


6 0 








tnt hd ts 2239407 


where z; and Z are the roots of the quadratic equation 
i2 2 1| 
(I) Ys ye | = 0. 


n 
as war 448 





| Yo 





The coefficients A; and Ae of the two exponential terms are obtained by solving 
the two simultaneous equations 









A; + Ae = wo 












(IV) 





AZ + Aote = 1 


In attempting to find suitable empirical equations for some “river rating 
curves”—graphs of discharge versus stage—the writer tried to make use of 
de Bray’s procedure. The original intention was to use the above method to 
determine the constants, and then to correct these constants by the use of 
Least Squares, as done by J. W. T. Walsh? in an application of the method to 
a problem in radioactivity. It often happens that a series of plotted obser- 
vations suggest a simple exponential function, but that when the observations 
are replotted on semi-logarithmic paper a straight line is not obtained. Often, 
as in the case of a good many river rating curves, the result may be described 








1 Macmillan & Co. Ltd., St. Martin’s St., London W. C. 2. 
2 Proceedings Phys. Soc. London XXXII. This reference is given by de Bray in his 
hook, ‘‘Exponentials made Easy.”’ 
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as “almost straight.” At first blush it might seem that in all such cases it 
ought to be possible to fit a curve with equation I to the data by de Bray’s 
Method. By an easy generalization of the above formulas, the constants in 
an equation with three or four exponential terms could be determined if two 
terms were not enough to secure a good fit. 

It was soon found, however, that innocent looking monotonic curves without 
points of inflection plotted from data that gave an “almost straight” line on 
semi-logarithmic paper quite often led to a quadratic equation, (equation III) 
whose roots were not both positive numbers. 


Y 


Fia. 1 


If z; and 2, the roots of III, are complex conjugates, it may be seen from IV 
that A; and Ag will be complex conjugates. Also, a; and a, will be conjugate 
complex numbers and may be calculated as follows: 

Let 2; = re‘? and z. = re“, 
then from equation II, 


re? 
re"? = 
whence, by division to eliminate r we have 


e209 —_ Gere), or 


226 
6 


(Va) 


a; — Ge. 
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Also, by multiplication to eliminate 8, 


9 


g = eF(artar) | or 


(Vb) ae 


= ae. 
5 a; + de 


The sum and difference of the two a’s being obtained by these expressions, 
one may solve for a; and ag. 


Let a=A\+ um A; a+ 
do=\ — um Ag=a— 
Then equation I becomes 
= (a + Ble? + (a — Ble, 
= 2e*[a cos ux — B sin ua], or 


(VI) y = 2*R cos (ur + c) 


where R = ~/c? + #? and tance = .. 

If one of the roots of III is negative, the de Bray formulas II and IV will 
still give an expression for equation I which formally reproduces yo, y:, y2, and 
y3 when 0, 6, 24, and 36, are substituted for x respectively, but which is useless 
for interpolating and of no value as a solution of the curve fitting problem. 
Suppose, for example, that z, is positive and zis negative. Then 


Zz = (—1) | 22 | and 


log z2 = log (—1) + log | ze}. 


Equation I then becomes 


z log | z2 


y= Aie* +(—-1)*Ace * , 


the factor (—1)* being real only when z is an integral multiple of 6. If the 
(—1) is written e‘7, we have 


Tex z log | 22 


Aie™ +e % Ase * ,or 


z log Z2 


TL ° Tr 
= A,e" + Ace 3 | cos ™ + csin =. 


Neither the real nor the imaginary part would be a graduation function for a 
monotonic curve as each has a half period of 6. 

The expression for I is similar, and of no greater practical value, if both of 
the roots of III are negative. 


ac s6 8 Qt OO ts 22429 4.07 
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Without loss of generality we may let yo = 1,7 = O ye EP =. 


ae 3 
Yo Yi ye 


the quadratic III becomes 


written in the form 


2+pz2+q=0, Le, 


(IIIa) 24 ro(r1 — 13) sad re(s — Te) _ 0. 
(72 — 11) (Te = r) 
Hence the roots of this quadratic are real and unequal if D > 6, equal if D = 6, 
and complex if D < 6, where 
: n. 
+9 
le 3 


r= [so 
Fi 3 


From the point of view of the computer, however, it is about as much work 
to calculate D as to solve the quadratic equation. 


P 


two 
negative 
roots 


complex 


Fic. 2 


Reverting to equation IIIa; suppose the numbers q and p are plotted. as the 
coordinates of a point (q, p) as in Fig. 2. Then the parabola p? = 4¢ is, so to 
speak, a locus of equal roots. The remainder of the figure requires no expla- 
nation. 

Suppose that all the r’s are positive, as they would be in the case of a simple 
monotonic curve which one proposed subjecting to an exponential analysis. 
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If gq < 0, the quadratic will have one negative root. Now 


rire(73 — Te 
= rials — 72) and hence 

(re — 11) 
fur g < 0, if r2 > r,, then r3 < 72 and consequently r3 < re > 7; and if m2 < 1), 
then 73 > f2, or 7) > Te < rs. Also, provided pe. > 4q, a positive p and a pos- 
itive q will give two negative roots. But 


rer; -- 13) 
Pp ee 


(re — 171)’ 

and p and q can not both be positive when all the r’s are positive as this implies 
either that 72 > m1, 71 > r3 and r3 > re, a contradiction, or else that rz < 7, 
ry) < r3 and rs < 72, also a contradiction. Hence if both roots are negative, the 
r’s can not be all positive. The case of two negative roots will not arise in 
trying to fit equation I to a monotonic curve, since if all the 7’s are positive 
both p and q can not be positive. 

For all r’s positive, provided p? > 44, a positive g and a negative p will give 
two positive roots. But 


rire(rs — 12) 

= ———_— >0 
(re _ r1) - . 
ro(rs — 71) 


a > 0 
(r2 — ri) 


means that r3 > m > 7 or rs < re <1. 

To sum up: If all the r’s are positive, de Bray’s method of exponential 
analysis is possible (a) when D < 6 and the roots of III are complex; (b) when 
D> 6andr > re > r3 or when 7 < 1%. < fs. 

Figure 3 gives a picture of the second condition (b) of the preceding para- 
graph. Suppose an exponential curve is passed through the first two points 
on the empirical curve with ordinates yo and y;. Its equation will be: 


=z Zz 

( y s 

yY = Yo\ — = Von . 
Yo 


Suppose also that ye is less than the ordinate to this curve when g = 26. Now 
pass an exponential curve through y; and ye using a new axis of ordinates 
coinciding with y,. Its equation is 
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or referred to the original axis: 


z—5 
¥Y = "te 


6 


Now if the graduation is possible without using trigonometric functions, y; 
must be less than the ordinate of this second curve when x = 36. 


the 
empirical 
curve 


It is natural to inquire if the state of affairs is not similar to this, for the 
cases of fitting curves with equations similar to I but having three or four ex- 
ponential terms on the right hand side instead of only two. If three terms are 
used (see Fig. 4) to find constants in 


(Ia) y = Aye" o Age™* aa Aget* 
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it is first necessary to find the roots of the cubic 


(IIIa) f(x) = 


1¥3s Yeo 2 Yo 


Now, f(x) will have no negative roots if f(—2x) has no changes of sign. But 
writing the conditions that the cofactors of the elements of the first row in 
the above determinant have the same signs, and assuming that all the y’s are 
positive, one does not get a series of conditions analogous to rs; > 7 > 1 or 
is <x 1% < %. 

In the following, formulas will be derived for finding the constants in equation 
Ia after the roots of IIIa have been determined. Also formulas will be obtained 
for finding the constants in 


(Ib) y _— A eat aa Ao e2t + As E43t a Ag eat 
after the roots of 
22 1] 
Ys Ys 
(IIIb) Ys Ys 
Ys Yo Yi 
Ys Ye Yi Yo 
have been found. Both sets of formulas have been tested by an “exponential 
analysis” of the same body of data, viz., the very accurate recent determina- 
tions by the U. S. Bureau of Standards of the saturation pressure of water 
vapor above 100C.* 
For the case of three exponential terms in the graduation function, the a’s 
are found by formulas like II or V, after the roots of the cubic are found. If 


21, 22, 23 are the roots, the A’s are obtained by solving the simultaneous equa- 
tions 


A, + Ao + Az = Yo 
(IVa) Aj) > Aoze + A323 = ¥1 
9 2 2 
Ajz;? + Acts + Asz3 = Yr 
3 Osborne, Stimson, Fiock, and Ginnings: The Pressure of Saturated Water Vapor in 


the Range 100° to 374°C. Bureau Standards Journal of Research, Vol. 10, Febr. 1933, 
page 178. 


~~ 


ter dg t2 2225 % 
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This presents no new difficulty unless two of the roots are conjugate complex 
numbers. In this event, if we let z: = the real positive root, z. = r e®, and 
z3 = r e~ the determinant D of the equations IVa may be written 















| 1 1 1 
Dz \2z 


re? re? 


~ 


2 
1 


y7_e218 pre 28 





or, expanded in terms of the elements of the first column and their minors, 
D = 2i[z,r? sin 20 — (r* + zir)sin 6], 
a pure imaginary. Similarly, 


A; D = 2i[r?y; sin 26 — (yor? + yer)sin 6], 





also a pure imaginary, so that A, is real. Having calculated A, it is substituted 
in the first two of equations IVa, which are then solved for Az and A3. a, 
and a3 are then determined by formulas Va and Vb, replacing the subscripts 1 
and 2 in those formulas, by the subscripts 2 and 3 respectively. Finally the 
two exponential terms corresponding to the complex roots of the cubic are 
combined into a single trigonometric term as in equation VI. 

The necessary formulas for the case of four exponential terms in the gradu- 
ation function will be discussed briefly. The equations 


+A, +As +A, 
AjZ; + Aoze + Ages + Ase 
Ayzy + Asze + Asz3 + Asti = yo 
Ayzi + Ags + Ages + Aye? 


have to be solved for the A’s. The z’s are the roots of IIIb. Two cases will 
be considered: First case: z,; and ze are complex conjugates and 23 and 2 are 
complex conjugates. Second case: z; and z, are complex conjugates and 23 
and z are real and positive. In either event A; and A2 are complex conju- 
gates, as will be proved below. Formulas for A; are given for both cases. 
Then Az is known since it is the conjugate of Ai. Having found A; and Ae, 
let . 











A, 




















(IVb) 























Co = Yo — (Ai + Az) 
CG = yr — (Arai + Adie) 










Both co and ¢; are then real. To get A3 and A, solve the equations: 


A 3 + A; = Co 


A323 + Agz4 = C; 
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A pair of exponential terms with conjugate complex coefficients will then be 
expressed as a single real trigonometric term as in VI. 
The determinant of equations IVb may be written 


(VII) D = (a — 22)(21 — 23)(21 — 24)(22 — 23)(Z2 — 24)(Z3 — 2). 


First case: Let z; = a + ob, 22 = a — b,23 =~ a+ B,%m4=a— Bf. ThenD 
may be written 


(VIIa) D = —4£b[(a — a)? + (b — B)*] [(a — a)? + (6 + 8B)’, 
which is real. Now 
| Yo 


Yi 


| 


A,\D + A:D = 


m4 
(2 + 22) 24 


2 2 3 
(2) + 2122 + 22) Y3 23 z 


| 
= (a — 22) | 
| 


and this is real since (2; — Ze) is a pure imaginary and the minors of the real 
elements of the first column of the determinant are all pure imaginaries. Hence 
A, and Ag are complex conjugates since when each is expressed as a quotient 
of two determinants by Cramer’s rule, the sum of the two numerators is real 
and the common denominator is also real. 


For purposes of numerical calculation A; may be obtained from 


= %? 
D 
in which D is obtained from VIIa, 
N = ys — (22 + 23 + 24)y2 + (2223 + 20%4 + Zaza)yr — (222324) Yo, 
and P = (22 — 23)(Z2 — 24)(23 — 2) 
= 28|(a — a)2b + {f(a — a)? + (8? — b*)}], a complex number. 


If 2:22 = r? and 232, = p”, the symmetric functions of the z’s in the above formula 
may be calculated from 


tb)p? 
2a(a — cb) 


ib)’ + 2a 
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For the second case, which is exemplified by the vapor pressure data, 


(VIIb) 






D = 2.b[(a — 23)? + b*] [(a — 24)? + B?] [z3 — za], 





a pure imaginary. The sum of the two numerators of A; and Az, namely 








1 Yi 2 %Z% | 
(z1 — 22) oad 
£1 ae <2 Yo 23 2% 








n 
ow 

nn 
-~o 


2 2 
Zi + aze+ 22 Ys 


is a pure imaginary, since (z; — 22) has this character, and the determinant 
has nothing but real elements. Hence A; and A:2 are still complex conjugates 
when 23 and 24 are real, z; and z. being complex conjugates. 

For purposes of numerical calculation A; may be obtained from 


N 


” th ~ a —- om — bo 

















Ay 


Here (z, — Ze) is a pure imaginary and the other three factors are complex. 


Let N = r,(cos 6; + csin 6) 










2; — Z3 = 7e(cos 6 + «sin 4) 
Zz; — 2, = 73(cos 63 + «sin 63) 


Then 


_ ni [eos (1 — —_— a) + isin (4, — b — 63)} 


~—"* (2) — 2s) Te 1s 





In calculating N by the formula given for it in the preceding paragraph, the 
symmetric functions of the z’s were obtained from 










= (a — wb)2324 






(a — cb)(z3 + 24) + 2324 







(a = wb) + 23 + %. 


Example 


The first two of the following tables are abstracted from Table 2, p. 178 of 
Bureau Standards Research Paper No. 523. The third table is abstracted from 
Table 3, p. 179 et. seq. of that publication. 2 is the number of degrees centi- 
grade above 100°. y is the pressure of saturated water vapor in International 
Standard Atmospheres. In the first two of the following tables, the values 
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of y are observed values. In the third, they are interpolated or graduated 
values calculated at the Bureau of Standards. 


TABLE I 





| ; | 


1.0000 
12.3887 
63 .3558 

207.771 





TABLE II 





y 


1.0000 
4.6969 
15.3472 
39 . 2566 
84.7969 
163 .205 








TABLE III 





x y 

0 1.0000 
39 3.4666 
78 9.4490 
117 | 21.612 
156 | 43.392 
195 | 78.974 
234 133.64 


| 273 | 215.37 





The observed values of y in Table I are reproduced by the following formula 


used in conjunction with a standard six place table of logarithms and trigono- 
metric functions: 


(I) y = 3.967433 ¢-%595402 eos (.4085758x — 75°24’03’’.7). 


The observed values of y in Table II are reproduced by the following formula 
used in conjunction with a standard six place table of logarithms and trigono- 
metric functions. 


y = 3.0253744 «018186052 
(IT) 


+ 2.2171657 61507162 egs (155°59'35"".5 — 0.7899232z). 
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Hence the formula is presumably an excellent one for interpolation between 
the values of y listed in Table II, if the greatest accuracy is not needed.* 

The values of y in Table III are reproduced exactly to five significant figures 
by the following formula used in conjunction with a standard six place table 
of logarithms and trigonometric functions. 


y = 3.8902543 ¢:014139202 a 164787 ¢—-02169302 
+ 2.743000 €:998842902 eos (.7860725x2 + 186°28'53’’.2). 


By means of this formula the saturation pressure of water vapor was calculated 
for every five degrees from 100°C to 370°C in order to make comparisons with 
the corresponding “smoothed” values in Table 2 of the Bureau of Standards 
publication referred to above. The discrepancies were never more than one in 
the fourth significant figure and generally less. The poorest agreement was 
in the ranges of temperature from 100°C to 135°C and from 245°C to 270°C. 

It is a pleasure to acknowledge the intelligent and painstaking assistance of 
Mr. G. D. Lambert, undergraduate student at Washington University, for doing 
most of the computing. 


WASHINGTON UNIVERSITY, 
St. Louis, Mo. 


4 The values of y in Table III (not counting the value of y for x = 0) are reproduced 
by it with an average error of .138% and a largest error (for x = 234°) of .80%. Four of 
the errors are negative and three positive. 





ON THE FREQUENCY DISTRIBUTION OF CERTAIN RATIOS 
By H. L. Rretz 


University of lowa 


Considerable interest in the distribution of ratios, £ = y/z, has no doubt 
been suggested by important applications. For example, we may mention the 
opsonic index in bacteriology, the ratio of systolic to diastolic blood pressure 
in physiology, and ratios such as link relatives or certain index numbers in 
economics. 

In 1910, Karl Pearson! gave certain properties of the distribution of ratios 
by means of approximate formulas for moments up to order four in terms of 
means, variances, product moments, and coefficients of variability of x and y. 
The resulting formulas did not give, with sufficient accuracy, the constants of 
the distribution of the opsonic index for the purpose of Dr. Greenwood to whom 
Pearson attributed the derivation of the formulas for the special case in which 
x and y are uncorrelated. Pearson next adopted the plan of tabulating the 


reciprocals, say x’ = > and then finding the constants of the distribution of 


the product yx’ in the case in which z’ and y are uncorrelated. He then ob- 
tained satisfactory results in illustrative examples. 

In 1929, C. C. Craig? obtained the semi-invariants of y/x in terms of moments 
of x and y, and then expressed the moments in terms of the semi-invariants of 
the distribution function, f(z, y), of x and y. By this means, he was able to 
deal with the case in which x and y are normally correlated under suitable 
conditions. Craig found it desirable to restrict the distribution of x in such a 
way that the probability of a zero value of x is an infinitesimal of sufficiently 
high order that a certain integral exists. This limitation seems to imply in 
applications to actual data that no zero values of x are to occur. This suggests 
that we deal with the cases of x at or near zero with considerable care. 

By starting with the assumption that the values of x and y are a set of 
normally distributed pairs of values with correlation coefficient r, and by con- 
sidering the quotient z = - - 2, a and b being constants, R. C. Geary,® in a 
paper published in 1930, found an algebraic function, u = f(z), of fairly simple 
form with the property that u is nearly normally distributed with arithmetic 
mean zero and standard deviation unity provided that a +- x is unlikely to 


1 On the constants of index distributions, Biometrika, Vol. 7 (1910), pp. 531-546. 
2 The frequency function of y/z, Annals of Mathematics, Vol. 30 (1928-29), pp. 471-486. 
3 The frequency distribution of the quotient of two normal variates, J. Royal Statistical 
Society, Vol. XCIII (1930), pp. 442-7. 
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have negative values. Here we have again a suggestion te exercise special 
care in the case of quotients with the divisor near zero or negative. 

In 1932, Fieller* obtained in explicit form the approximate distribution of 
t = y/x where values (2, y) are drawn from the bivariate normal distribution 


1 i 8 fle=zP | (u=9F a 
Qroroy V1 — Pr? 


e 2 1—r? \ ox oy? Or Oy 
under the condition that % is large compared with o;. 


Y 


Very recently Kullback® found the distribution law of the quotient, ¢ = y/z, 
where x and y are drawn from Pearson Type III parent populations given by 


e 7 gpl ey ge 
M(p) ° rq) ’ 


It is fairly easy to see, in a general way, that the distribution of t = y/z 
depends very much on the location of the origin as well as on the parent distri- 
bution from which x and y are drawn. This fact will be fairly obvious from the 
present paper whose main purpose is to give clear geometrical descriptions of 
the distributions of ratios, # = y/z, for each of several cases in which (2, y) 
are points taken at random from certain simple geometrical figures conveniently 
located with respect to the origin. 

In accord with the suggestions to be cautious when the divisor is near zero 
or negative, we consider first the very simple case of ratios t = y/x obtained 


fi(x) = fly) = 


4E. C. Fieller, The distribution of the index in a normal bivariate population, Bio- 
metrika, Vol. 24 (1932), pp. 428-440. 
5 Solomon Kullback, Annals of Mathematical Statistics, Vol. VII (1936), pp. 51-53. 
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from points uniformly distributed over a rectangle such as is shown in Fig. 1 
with sides parallel to coordinate axes and a, > 0,b; > 0. As indicated on Fig. 1, 


we assume for simplicity that the coordinates of the points are positive and 
Heir sSamnSysh. 


Cost. Wien @ «2 et. 


Q,  @’ 


Let k dx dy be the probability that a point (x, y) taken at random in the 
rectangle will fall into drdy where k is a constant. Then 


be a2 
kf [ dxdy = k(ae _— a) (be _ b;) = ‘ 
b, a, 


— 1 
'* ae 


Transform the element k drdy into one with variables t, and x by making 


and 


% = Z, 
y = &. 


The Jacobian is |x| = 2. 

The new element is k x dxdt and is to be integrated over the range on x for 
an assigned ¢ in order to get the probability, to within infinitesimals of higher 
order, that a random ¢ falls into an assigned dt. By assigning ¢ any value such 
that = sta = say tis the slope of MN, (Fig. 1), we have 

1 


9 


a2 ks 
1 k rdadt = = 
(1) [ rdxe 5 


t 


the limits of integration being indicated by the ends of the line MN. 

When the assigned ¢ is such that = Sita be say t is the slope of the line 
ai 2 
M'N’, we have 


(2) r | gdrdt = (a2 — aj) dt 


1 


When the assigned ¢ is such that * < =, say it is the slope of MN”, 
2 1 


we have 


(3) 
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b 


Thus, from (1), (2), (3), when as in Fig. 1, = 
1 


< “ the frequency function 
2 


of t is given by 


_ kf 2 a) : b b; 
(4) F(t) = 5 («: ~~ when : = 

bs 
= Qe’ 


(5) F(t) = 5 (a3 —aj) when 


k b2 2 be bo 
= — — << = = —-. 
(6) F(t) 5 (: at) when n= t< - 


See Fig. 2 for the general form of the frequency curve F(t) when = < bs 


1 ay 


with the segment from ¢ = 2 to a horizontal straight line and with discon- 
1 2 


tinuities in the first derivatives of F(t) at ¢ = b and ¢ = > 
ay 2 


Fit) 


Fic. 2 


When a; — 0, and b; = 0,-the frequency curve approaches 


(7) F(t) = ~ when OSt< be 
2b: a2 


F(t) = ii when t= 2. 

It may be noted that the curve given by making a, = 0 and b; = 0 extends 
to infinity, and that the first and second moments about the origin are each 
infinite. 





ads 
ach 
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Case II. When bs > > 
a, a2 
If the rectangle in Fig. 1 were moved upward keeping its sides parallel to the 


~ . 
z and y axes until — > —, we would obtain 
ay a2 


ets i) a 

_ Kk ae 2, « Ob, . br 

(10) F(t) = op (b3 — bi) if de = t = a,’ 
RN Bie il 


By comparing (5) and (10), it may be observed that F(t) of the middle seg- 
ment of the distribution curve differs much in Case II from its corresponding 
constant value in Case I. 

By moving the rectangle of Fig. 1 downward, keeping its sides parallel to the 
xz and y axes until b; is negative, we easily find further forms of the distribution 
curve F(t). 

To consider the distribution of the ratio = y/x for another very simple type 
of distribution of x and y, suppose we have given the distribution function 


-2-4% /x =>c> 0, y non-negative 
= fk a b e 
(12) f(z, y) ke Ce ) 


where I i f(x, y) dx dy = 1. Then 
0 c 








ecla 
k = ‘ 
ab 
In this case, 
ela [@ ee ee 
F(t) = zea 6 dx 
ab |, 


(13) 





— (c 4 ab ) ue “ 
~ b+at b+ at : 
a monotone decreasing function from ¢ = 0 tot = ©. 


With c = 0 as a limiting value, we obtain 


ab 
(b + at)” 


a distribution curve with the mean value of ¢ at infinity. 


(14) F(t) = 
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If we should similarly consider 


z2 y2 


(15) f 1 - ; 2ox 


**” (x and y non-negative) 


we easily obtain 


(16) ~ an 
2 1020, (4 + ‘) 
OC, Cy 
as the distribution function. 
Although the difficulties* of the problem of the distribution of the ratio y/z 
when zx and y are normally correlated have been overcome’ to a considerable 


Y 


A(a,,4) 


Fic. 3 


extent, still the examination of some very simple geometric cases of non- 
normal but linear correlation may not be without some interest. Such a case 
will now be considered. 

For one very simple case in which x and y are correlated, suppose we are 
given a set of points (x, y) uniformly distributed over the parallelogram ABCD 
(Fig. 3) with sides AD and BC parallel to the y-axis so that the regression of 
y on « is linear as shown by the line RS. 

The equation of RS is 


bi + be 


(17) y = mz — 4) + —; 


6 Loc. cit., Pearson, p. 531. 
7 Loc. cit., C. C. Craig, R. C. Geary, E. C. Fieller. 





yn- 
ASE 


are 
CD 
of 
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Then although z; and y; are correlated, x; and 


Ui = yi — mx; — aj) -it* 


are uncorrelated. Let us consider the distribution of the ratio = 4. 
Consider the element of frequency kdxdy’, where 

(18) k(be — bi) (a2 — am) = 1. 
Change variables to x and ¢’ by the transformation 


r=, 

y = t's. 
Then the element of frequency becomes 

(19) kx dx dt’. 


Next integrate (19) with respect to x under the restriction that ¢’ is assigned. 
Three cases occur: 


be = b, t’ < be — b; 
2d2 ~ 202 
element of relative frequency of t’ in dt’, 


(a) When — 


IIA 








, we obtain by integration of (19) for the 

















(20) k [ x dz dt’ = 5 (a3 ~ @:) &’. 
(b) When t’ = bs ao we obtain 
bo—bi 

(21) bf sacar = F/O ota 
(c) When?’ < m... 5a. > we similarly obtain 
(22) bf - drat = $[ =" _ at) av 

From (18), (19), (20), (21) and (22), the frequency function of ¢’ is given by 
(23) F(t’) = ae when — a ers bs =; 





| i a, 1 (bo —_ bi)? 2 
(24) Ft’) = 2(be — bi) (a2 — ay) | 4t” 7 ot 
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where the range of t’ is subject to either the inequalities, 


bo _ b; < i’ < bo _ > or a be —_ by < i! OP be - br 
Za, @ ° «2a; — 2a» 
See Fig. 4 for the general form of the F(t’) frequency curve. 
If we make a; = 0, the curve becomes infinite in range. If we make not only 


a, = 0, but (b: + be)/2 = 0, we have, in place of (17), 
y = mz. 


asa a nt . ; . bo — 
In this limiting situation, if we make ag = a and — 


FR) 


4 - b, 


2a, 


(23) becomes 


(25) Fit’) = ol for — b af <¢ . and (24) becomes 


4b a a 


» lt 
; , for v2 andfor t’/< — -. 
4 at? - 


(26) F(t’) = 


Then we have y’ = y — mx’ 


, 

Y 

and i’=-=t{—m. 
x 


Further, if ¢’ is distributed in accord with a frequency function, F(t’), the 
distribution of ¢ = t’ + m with m constant is given by 


F(t — m). 
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Hence, the probability that a random value ¢ will fall into a range ¢ to ¢ + dt 
is given to within infinitesimals of higher order by 


( t l 
“dt when m—~- <timt+ a 
4b a a 


(27) 


and by 


] 
b dt when (tnd teen 


(28) ja — m> zm+e a 


With the frequency curve given by (27) and (28) we may note that the variance 
of t becomes infinite. 

Without taking the space to continue illustrations, it is fairly obvious that a 
wide diversity of form can be given to the irequency function of the quotients 
t = y/zx by relatively simple changes in the location of a sample parent popu- 
lation with reference to the origin. 





EDITORIAL 


THE FUNDAMENTAL NATURE AND PROOF OF SHEPPARD’S 
ADJUSTMENTS 


In the course of our discussion of moment adjustments, we shall have occasion 
to refer to the following lengthy distribution of discrete variates. By selecting 


TABLE 1 
Distribution of the number of items correctly recorded by 244 students in a five 
minute code transcription test* 


; er 

Score Freq. Score | Freq. Score | Freq.’ 
| . | | . 
x | I x f # f 





119 

120 
121 
122 
123 
124 
125 
126 
127 
128 
130 
131 
132 
133 
134 
136 
138 
140 
141 
142 
144 
153 
155 


64 94 
95 
96 
97 
98 
99 
100 
101 
102 
103 
104 
105 
106 
107 
108 
109 
110 
111 
112 
113 
114 
115 
116 
117 
18 | 


84 
85 
86 
87 
88 
89 
90 
91 
92 
93 


—e DONDE KB BB Ke SOF NNKPWONWN ON 





1 
2 
2 
1 
1 
3 
3 
3 
1 
2 
3 
1 
2 
2 
3 
2 
6 
3 
1 
2 
4 
4 
5 
2 
+ 


DWOANG ON PD PWWIHDWRHWDW|HOHP ND WW OW 





| Total | 244 


*I am indebted to Professor J. A. Gengerelli, of the Department of Psychology of 
Univ. of California at Los Angeles, fer these data. 
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the provisional mean, My = 105, we find that 
Zz f= —129 > xz? f = —52 005 
=2f=i77 591 Sat f = 69 239 951. 

Let us now form the nine possible distributions of grouped-discrete variates 
that arise from the nine possible “groupings of nine.’’ These are presented 
in table 2. 

TABLE 2 
Distributions derived from the data of table 1 by making the nine possible 
“groupings of nine” 








First significant class interval of distribution 











(2) (3) | ) (5) | ©) (7) | 
63-71 62-70 61-69 60-68 59-67 58-66 


mw tt 2 yt et ets 
is | 16 | i | 14 (| (4 | (18 
Zs i niemsastw@tini| « 
41 | 33 | 32 
54 | 63 | 641 52 | 49 
45 | 40 | 38 | | 44 
27 | 29 | 36 | 40 
19 | 24 | 5 | : | | 28 

| Fy i | 12 


30 | 28 | 3i 


| 
| 
| 
| 








Let us now compute the values of Taf, S2?*f, Tx*f and Sxif for each of the 
distributions of table 2, selecting My = 105 in each instance in order to facilitate 
a comparison of these results with those for table 1. Thus, in spite of what 
would otherwise be called poor computing technique, we shall use the following 
class marks as values of x for the first distribution above; —37, —28, —19, ---, 
35, 44, 53. For the second we shall likewise use, —38, —29, —20, --- , 34, 43, 
52, respectively. 


TABLE 3 


Summations derived from the distributions listed in table 2, using Mp = 105 





Dist. =z f =z7f | =r 





(1) — 181 77 6149 :~—C*" 134 191 69 265 
(2) — 218 | 466 - 54 602 74 519 962 
(3) - Vil 77 «6769 2 889 71 465 409 
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TABLE 3—Continued 





Dist. rxrf raf Srf Srtf 








(4) — 4139 | 7 747 | — @ 311 74 171 Nas 
(5) — 104 | 81 934 | 19 666 76 143 874 
(6) — 87 | 80 145 | 16 551 72 467 541 
(7) — 52 | 80 302 | — 36 118 | 71 851 930 
(8) — 89 | 78 553 | — 101 357 | 68 426 497 
(9) — 180 | 78 894 | — 180 792 | 73 155 150 

































Average | — 129 | 79 2173 | — 54 585 | 72 362 7853 


| 








The fact that the average of the values of =af appearing in table 3 suggests 
that no adjustments of the first moment is necessary and that the variations 
in the nine values for 2af may be regarded as accidental errors and attributed to 
grouping. An attempt to account for this phenomenon and also for the fact 
that the averages of the higher order summations of table 3 do not likewise agree 
with the corresponding summations of table 1 lead us directly to formulae for 
Sheppard’s adjustments. 

For the moment, let us concentrate our intention upon a single variate, 2, 
and its associated frequency, f;,, that are a part of a distribution of discrete 
variates, such as table 1. Suppose we were to form the k different distributions 
arising from the k possible “groupings of k.”” In one of these distributions, 
Xo will rest in the first position of a class interval: the limits of this class are 2 
and (a + k — 1) and the class mark is therefore [x + 3(k — 1)]. The 
contribution of the variate, 2, to =a*f for this particular distribution is therefore 


[vm + 3(k — 1)]’-fe,. 


If z rests in the second position of a class, the limits of this class will be 
(a — 1) and (a + k — 2) and the corresponding class mark is [x) + 3(k — 3)] 
and the contribution of 2» to =z*f for this distribution is 





[ro + 3(k — 3)]°-f,,. 


The expected value of Dx f arising from the / different groupings of variates is 
therefore, 







° 


1 ° i: 
(1) EQ U xf) = tL as + Serf + Erg] 


a 
where >, 2'f refers to that distribution in which a specified x rests in the 7-th 
position in the class in which it occurs. The contribution of x9 to this expected 
value is therefore 
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(2) = {leo + 4b — Ul + feo + 3 — 3K + [oo + 4 (ES + eV fa, 


this series consisting obviously of k terms. 
Expanding each term of (2) by the binomial theorem yields 


. ae G3) + C2237? gy — C323 -° (é 


oe . _ 9\2 : 
-_ Cai E>") + C27? (45°) ~ C323 (& 


6 — Cixi (" 5 °) + C205 (' 5 °y — Cs25* (: 


ete. 


Since s is an integer, series (2) may be written as the sum of the (s + 1) terms 
of the series 


[x§ So — Cr 267 * Ss + C2 2xG~* Ss — Cz 25-* + --- Jf, 


, ‘ k — 3\t k — 5\ : 
{( y+ Ey 4+( 9 ) 4 +++ tox terms]. 


By the Euler-Maclaurin Sum Formula we have 


b 
y v= 


54 (br — art) 4°3 (b+ a) + Ftp (rt — a) 


= 7 pi (br-* — ar) 4 = p® (brs — ar) + -.., 


where p® = p(p — 1) (p — 2) (p — 3) --- toz factors. In our expression for 
S:;,a = 3 (k — 1) = —b, and therefore S; equals zero when 7 is an odd integer. 
For even values of 2, 


_ 2)(kK—-Vi(k+i) , B (! as 7 
S=i\gragy tats 


Bs 3) Sy) Bs -s) = Fa 
“7 ( 2 tei \g 





(4) 
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so that 





So = 1 


cans 1 2 
S: = ip \* — 1) 
S. = — ( — 1) (3k — 7) 
240 


ae 1 ek one __. " 
Ss = 1344 (k 1) (3k 18k? + 31) 


ete. 


Since expression (3) represents the contribution of any variate, 2, to the 
expected value defined by (1), we may obtain by summation 


(5) E(Dl af) = Diatft C2-S-Doae?ft Ci- Si dDoxift---. 


To illustrate: if we desire to shorten the distribution of table 1 by forming class 
intervals of dimension 9, 














1 ge _ 20 2 «a 
B= 5%-D= 5, Si= 59 PF —-DCB-F¥-D==, 


and by formula (5), 


EQ « f) = Lief = — 129 


E(>) 2 f) = Dorf +.C2-Se- > f = 77591 + - . 244 = 792177! 


E(>) ff) = Di ftsC2-Se- do xf = — 52005 + 3.2 (- 129) = — 54585 
Edo tf) = Veatf+ C2-Si- Deft Ws-Si- DS 


= 69239951 + 6.- = - 77591 + = . 244 = 723627852! . 


Since these expected values are identical with those computed directly in table 3, 
we see that formula (5) provides the adjustments necessary to eliminate the 
effect of the systematic errors caused by grouping. 
Dividing both sides of (5) by =f yields 
(6) E(us) = wy + Ce - Se- wee + Ca - Samia + Co- So- weet: 
that is 
E(ui) = 1 


i sll di 
E(us) = Me + 75 (k = 1) 
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E(us) = M3 + 75 =e = 1) My 


E(u) = me + ae a, (i? ~ Dvn + oe 70 (# — 1) (Bk* — 7) 


E(us) = ws +2 — 1)43+ (k? — 1) (3k? — 7) wy 


sa 


etc. 


In numerical computations we generally prefer to select the class interval 
as the unit of z and in this case we have 


E(u;) = w4 


! ' 1 1 
E(ue) = pe +p = 5) 


' ’ 3 1\ , 
E(us) = 3 +5(1 - *) My 


' ' 6 1 1 
Bui) = 0 + B(1- x) at + a9 (1- (3-3 


etc. 


Ordinarily we are interested in estimating the values of the moments that 
would have been obtained if we had not used the time-saving device of grouping 
the variates and therefore we solve the previous set of equations for the moments 
of the ungrouped distribution and obtain 


uy = E(u3) 


= E(u2) — At ~ z) 
= Bus) — 3 (1 - ) au 
= Bui) - & (1-8) 8a) + (1 - p) (7-3) 


etc. 





In general we may write, corresponding to formula (6), 
(8) a, = E(uj) — C2 Pe» E(ui-2) + Ca- Pa E(us—s) — +: 


where 
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1 1 i 3 
bmg Blt ~B) 


1 1 18 3 
Po = ag (1- p) (1 - B+ 8) 











1 1 239 «55 5 
Po = 35m (1 - pu) (381- + Bp) 
1 1 set 1636 , 410 ° 52 3 
Pu = sarap (1 - jp) (2888 - “Ee + SP- B+ Bs) 
1 1 ~- 910573 . 233570 32410 
Py» _ 5591040 (1 -) (1s14477 = ke —-+ is cecal Ke 
2625 105 
 ~ Fo): 


An actual problems we do not know the exact values of the expectations 
involved in formulae (7) and (8), and are forced to obtain mere approximations 
by utilizing in their stead the corresponding moments computed from the 
single chance grouped distribution. These approximations correspond to those 
employed in the theory of probable error, namely, substitutions of the moments 
derived from a single sample for the corresponding expected moments of the 
parent population. 

The adjustments so far considered may properly be referred to as Sheppard’s 
adjustments about a fixed point. At first thought it might appear that we might 
obtain corresponding formulae for the expectations of moments about the mean 
by merely dropping the primes in formula (6) and obtain, for example, 


1 2 
jn = E(u) — 35 (8-0), 


but unfortunately this is not true. For example, the exact value for the variance 
of the distribution of table 1 is 18915563/244*. Using the summations of 
table 3 and computing the variance for each of the nine groupings yields 


E(u) = 5 5qp 118791595 + 19098180 + 18963315 + 19438947 


(9) + 19981080 + 19547811 + 19590984 + 19159011 + 19217736] 


19309851 /244?. 
2 
12 


Since = (k? — 1) = (9? — 1) = 20/3 we see that 


a ie 
ve < E(u2) — 75 (kh — ). 


In the theory of sampling we differentiate between the standard errors of 
moments about a fixed point and the standard error of moments about the mean 


NATURE AND PROOF OF SHEPPARD’S ADJUSTMENTS 161 


of the sample. Apparently writers on the subject of Sheppard’s adjustments 
have overlooked the case of adjustments about the mean, although the solution 
for the second moment is readily obtained as follows: 


E(u) = E(u; — M*) = E(uz) — E(M?) 
, 1 1 
=e +75 - 1) —- 7 (Mi + Mi + --- + Mi), 


where M ; represents the mean of the 7-th of the k different grouped distributions. 
Since 


’ 1 
se. — ; Mi + Me+ dd + M,), 


E( uz) = m+ J ae — 1) 


— | Mit Mie ME (Mt Met nt My) 
k k 


But since for any set of k variates 


we have that 

i 
12 
Referring back to table 3 we find that 


52 = 2856 
“ 3.(244?) 


(10) E(u.) = ws + <= (h — 1) — of. 


and the numerical results now satisfy equation (10). 

For the benefit of those interested in unsolved problems of mathematical 
statistics we may say that nothing appears to have been written as yet on the 
most important problem associated with the systematic errors due to grouping. 
It is of course desirable to eliminate these systematic errors introduced by 
grouping, but it is even more important to investigate the distribution of the 
accidental errors that remain after the systematic errors have been eliminated. 
For example it is gratifying to know that no systematic errors are present in the 
af column of table 3 and that equation (6) will enable us to add a constant to 
each summation of the =z*f column so that the mean of these adjusted values 
will agree with the value =2*f = —52005 obtained in table 1. It is rather dis- 
concerting, however, to realize that in actual practice we may in the case of 
discrete variates and must in the case of continuous variates select an arbitrary 
set of class limits for our recorded data, and that after adjustments for grouping 
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have been made, our estimates of the true values of the moments of the distri- 
bution will—as in table 3—depend so much upon the choice of these limits. 
Thus, the standard error of the mean attributed to grouping is 


1. /7856 _ 
Cy = a = 0.21, 


which is about twenty percent as large as the approximation for the standard 
error of the mean due to sampling from an infinite parent population, namely, 


oO 
Cy = VN = 1.15. 


If one will take the trouble to compute the values of u3 and ys for each of the 
distributions of table 2, utilizing the summations of table 3, and then compute 
and compare the values of o,, and o,, due to grouping with the corresponding 
functions associated with sampling, he will realize the seriousness of the situation. 


SUMMARY 


The formula for Sheppard’s adjustments for distributions of grouped discrete 
variates was first given without proof in the Editorial of Vol. 1, No. 1 of the 
Annals (page 111). The method used to develop the general formula was 
extremely laborious and paralleled the method used for the case of continuous 
variates in the Handbook of Mathematical Statistics, Chapter 7, except that the 
calculus of finite differences was employed. A more satisfactory proof of this 
formula was presented by Dr. J. R. Abernethy in Vol. 4, No. 4 of the Annals 
in an article entitled ‘On the Elimination of Systematic Errors Due to Grouping.” 
An extremely elegant development of the same formula and an extension to the 
case of two variables appears elsewhere in this volume by Professor C. C. Craig. 
From the point of view of expectations, all of these developments are adjust- 
ments about a fixed point, although this fixed point may be selected arbitrarily 
at the mean of the distribution in question. The obtaining of formulae for the 
adjustments about the mean of each grouping and the distribution of the 
accidental errors that remain after these systematic errors have been removed 
has apparently been neglected to date and should interest students of mathe- 
matical statistics. 

From a mathematical standpoint, the development of this paper is the 
simplest of all that have appearéd to date: the adjustments for the first four 
moments can be worked out with the aid of the binomial considerations leading 
to formula (8) and the following well known formulae for the sums of the powers 
of the first n integers: 


gates? ins? 


2 4 


n(n + 1)(2n 4. 1)(3n? + 3 o~ 1) 


5,= n(n + 1)(2n + 1) 
———— 30 


6 





S,= 
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One should note that the condition of high contact is not required in this 
paper or in the developments of Abernethy or Craig. The results of the three 
preceding papers agree with those obtained about a fixed point in this paper, 


but fail to hold for the case of expectations about the mean, if we accept the 
following definition: 


E(us) = 7 (ue + mae + ++ + Hct); (s = 2,3,---) 


where u,;; designates the s-th moment computed about the mean of the 7-th 
grouped distribution, (1 <7 Sk). 


H. C. Carver. 





